import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as sns import plotly.express as pximport plotly.graph_objects as gofrom scipy.stats import gaussian_kdefrom plotly.subplots import make_subplotsfrom sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScalerfrom sklearn.decomposition import PCAfrom sklearn.cluster import KMeansfrom sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score, adjusted_mutual_info_scorefrom yellowbrick.cluster import SilhouetteVisualizerfrom plotly.offline import init_notebook_mode, iplot# Initialize Plotly for notebook integration#init_notebook_mode(connected=True)# Set Plotly configuration for better embedding#import plotly.io as pio#pio.templates.default = "plotly"#pio.renderers.default = "notebook_connected"df_merged = pd.read_csv("/Users/willsigal/Desktop/Work/Shopcore/properties_with_all_metrics.csv")
Analysis
This was a demonstration of the CRE analysis I can create for acquistions. I began by compiling a comprehensive dataset of a CRE firm’s protfolio - in this instance ShopCore, a subsidary of Blackstone. Utilizing web scraping techniques combined with various APIs, I gathered data points for each property. The resulting dataset combines foot-traffic analytics, detailed customer demographic information (see Persona.ai), and census data.
By integrating these sources of data, I developed a classification model that goes beyond traditional industry standards such as the ICSC classification system. The developed model provides insights tailored specifically to ShopCore’s unique portfolio visitation and demographic profiles.
The application of this classification system is twofold:
Portfolio Insights: Rapidly gain a deeper understanding of each property’s unique market position and visitor profile within the existing portfolio.
Acquisition and Strategic Benchmarking: Efficiently evaluate potential acquisitions by comparing their profiles to existing high-performing properties. For example, the model enables rapid identification of new opportunities by comparing them to properties already in the portfolio. For example, if you were looking at a property in Bronzeville, Chicago anchored by Jewel Osco and complemented by tenants like Culvers and Snipes, the model would classify it as a hypothetical “Type 4” property thats 70% similar to one of the properties, like Vista Shopping Center.
About This Report (if viewing this as an HTML file)
Navigate through the document using the table of contents on the left
Most of the plots are interactive, so feel free to explore the data
The analysis is draft and could be updated to be more accurate and informative if I was able to get more data, aside from what I could find on the web
Section 1: Analyzing the Portfolio
the Top 10 Performing Centers in 2024 by Visits per Square Foot
The graph below shows the top 10 performing centers in 2024 by visits per square foot
We can see that Plaza Pacomia, Stony Creek, and Rosewood Shopping Center are the top performers with 29, 25, and 23 visits per square foot respectively. After that, the numbers drop off to around 20 visits per square foot
Code
# Get top 10 performing centers in 2024top_10_2024 = df_merged.nlargest(10, 'visits_per_sqft_2024')# Create interactive bar chartfig = px.bar( top_10_2024, x='Property Name', y='visits_per_sqft_2024', title='Top 10 Performing Centers by Visits per Square Foot (2024)', labels={'Property Name': 'Shopping Center','visits_per_sqft_2024': 'Visits per Square Foot' }, hover_data={'Property Name': True,'visits_per_sqft_2024': ':.2f','Overall GLA (SF)': True,'visits_2024': True })# Customize the layoutfig.update_layout( xaxis_tickangle=-45, showlegend=False, height=600, margin=dict(t=50, b=100, l=50, r=50))# Add hover templatefig.update_traces( hovertemplate="<b>%{x}</b><br>"+"Visits per SF: %{y:.2f}<br>"+"Total SF: %{customdata[0]:,.0f}<br>"+"Total Visits: %{customdata[1]:,.0f}<br>"+"<extra></extra>")# Display the plotfig.show()# Create a professional table for top performersdisplay_cols = ['Property Name', 'visits_per_sqft_2024', 'Overall GLA (SF)', 'visits_2024']styled_table = top_10_2024[display_cols].copy()# Rename columns for displaystyled_table.columns = ['Property Name', 'Visits per Sq.Ft.', 'GLA (SF)', 'Total Visits (2024)']# Apply styling for a professional lookstyled_table = styled_table.style.format({'Visits per Sq.Ft.': '{:.2f}','GLA (SF)': '{:,.0f}','Total Visits (2024)': '{:,.0f}'}).set_caption('Top 10 Performing Centers by Visits per Square Foot (2024)').set_table_styles([ {'selector': 'caption', 'props': [('font-weight', 'bold'), ('font-size', '1.2em'), ('text-align', 'center')]}, {'selector': 'th', 'props': [('background-color', '#0066cc'), ('color', 'white'), ('font-weight', 'bold'), ('text-align', 'center')]}, {'selector': 'td', 'props': [('text-align', 'center')]}, {'selector': 'tr:nth-of-type(even)', 'props': [('background-color', '#f2f2f2')]}]).hide(axis='index')# Display the styled tabledisplay(styled_table)
Table 1: Top 10 Performing Centers by Visits per Square Foot (2024)
Property Name
Visits per Sq.Ft.
GLA (SF)
Total Visits (2024)
PLAZA PACOIMA
29.37
203,743
5,984,549
STONY CREEK PROMENADE
25.14
157,647
3,963,282
ROSEWOOD SHOPPING CENTER
23.27
36,887
858,232
VISTA SHOPPING CENTER
20.85
88,699
1,849,305
KEARNY SQUARE
20.62
138,860
2,862,892
OAK PARK VILLAGE
20.60
64,659
1,332,072
NORTHEAST TOWER CENTER
19.13
301,209
5,763,103
BARCLAY CROSSING
18.98
54,958
1,043,286
UNIVERSITY PALMS
18.81
99,172
1,865,200
STONEHILL & DEL OBISPO
18.38
52,675
967,915
Looking at the Distribution of Visits per Square Foot in 2024
Instead of looking at the top 10 performing centers, I wanted to examine the distribution of visits per square foot in 2024. This will help us understand the range of performance in the portfolio. Are there outliers? And how centered is the distribution?
The mean visits per square foot is 14.2, and the median is 13.6. This means that the distribution is right-skewed, and there are more properties performing below the mean than above.
We can see that Rosewood Shopping Center and Plaza Pacomia are performing above the mean, while Streets of Woodfield and Neapolitan Way are performing far below the mean.
Code
# Create KDE plot with individual pointsfig = go.Figure()# Add KDE curvekde_x = np.linspace(df_merged['visits_per_sqft_2024'].min(), df_merged['visits_per_sqft_2024'].max(), 100)kde = gaussian_kde(df_merged['visits_per_sqft_2024'].dropna())kde_y = kde(kde_x)fig.add_trace(go.Scatter( x=kde_x, y=kde_y, mode='lines', name='KDE', line=dict(color='blue', width=2)))# Add individual pointsfig.add_trace(go.Scatter( x=df_merged['visits_per_sqft_2024'], y=np.zeros_like(df_merged['visits_per_sqft_2024']), # Place points at y=0 mode='markers', name='Properties', marker=dict( size=8, color='red', opacity=0.6 ), hovertext=df_merged['Property Name'], hoverinfo='text+x'))# Update layoutfig.update_layout( title='Distribution of Visits per Square Foot (2024)', xaxis_title='Visits per Square Foot', yaxis_title='Density', showlegend=True, height=600, hovermode='closest')# Add hover template for pointsfig.update_traces( hovertemplate="<b>%{hovertext}</b><br>"+"Visits per SF: %{x:.2f}<br>"+"<extra></extra>", selector=dict(mode='markers'))# Display the plotfig.show()# Create a professional table for summary statisticsstats_df = pd.DataFrame(df_merged['visits_per_sqft_2024'].describe()).reset_index()stats_df.columns = ['Statistic', 'Value']stats_df['Value'] = stats_df['Value'].round(2)# Ensure median is included in the statisticsif'median'notin stats_df['Statistic'].values: median_value = df_merged['visits_per_sqft_2024'].median() median_row = pd.DataFrame({'Statistic': ['median'], 'Value': [round(median_value, 2)]})# Insert median after mean (which is typically at index 1) stats_df = pd.concat([stats_df.iloc[:2], median_row, stats_df.iloc[2:]]).reset_index(drop=True)# Create a styled HTML tablestats_html ="""<div style='max-width: 600px; margin: 20px auto;'> <h4 style='text-align: center; margin-bottom: 15px;'>Summary Statistics: Visits per Square Foot (2024)</h4> <table style='width: 100%; border-collapse: collapse; font-family: Arial, sans-serif;'> <thead> <tr style='background-color: #f2f2f2;'> <th style='padding: 10px; border: 1px solid #ddd; text-align: left;'>Statistic</th> <th style='padding: 10px; border: 1px solid #ddd; text-align: right;'>Value</th> </tr> </thead> <tbody>"""# Add rows to the tablefor _, row in stats_df.iterrows(): stats_html +=f""" <tr> <td style='padding: 8px; border: 1px solid #ddd; text-align: left;'><b>{row['Statistic']}</b></td> <td style='padding: 8px; border: 1px solid #ddd; text-align: right;'>{row['Value']:,.2f}</td> </tr> """stats_html +=""" </tbody> </table></div>"""# Display the HTML tablefrom IPython.display import HTMLdisplay(HTML(stats_html))
Summary Statistics: Visits per Square Foot (2024)
Statistic
Value
count
58.00
mean
14.19
median
13.68
std
4.53
min
1.52
25%
11.85
50%
13.68
75%
15.73
max
29.37
Looking at the change in visits between 2022 and 2024.
Moving from looking a t performance in 2024 to looking at the change in performance over time, we now compare visits in 2022 to 2024. By doing this, we can isolate the top and bottom performers in terms of visit growth and decline.
The change can be due to occupancy changes, market conditions, or changes in the consumer’s behavior.
Let’s examine the distribution of our trade areas. This metric helps us understand the size of the area that our shopping center serves.
We can see that the trade area size is right-skewed, with a strong density of properties having trade areas around 25 sq. miles. However, we have a few large outliers with incredibly large trade areas (Streets of Woodfield and One Colorado with +100 sq. mile trade areas).
Code
# Create KDE plot fig = go.Figure()# Add KDE curvekde_x = np.linspace(df_merged['trade_area_sq_mi'].min(), df_merged['trade_area_sq_mi'].max(), 100)kde = gaussian_kde(df_merged['trade_area_sq_mi'].dropna())kde_y = kde(kde_x)fig.add_trace(go.Scatter( x=kde_x, y=kde_y, mode='lines', name='KDE', line=dict(color='blue', width=2)))# Add individual pointsfig.add_trace(go.Scatter( x=df_merged['trade_area_sq_mi'], y=np.zeros_like(df_merged['trade_area_sq_mi']), # Place points at y=0 mode='markers', name='Properties', marker=dict( size=8, color='red', opacity=0.6 ), hovertext=df_merged['Property Name'], hoverinfo='text+x'))# Update layoutfig.update_layout( title='Distribution of Trade Area Sizes', xaxis_title='Trade Area Size (Square Miles)', yaxis_title='Density', showlegend=True, height=600, hovermode='closest')# Add hover template for pointsfig.update_traces( hovertemplate="<b>%{hovertext}</b><br>"+"Trade Area: %{x:.2f} sq mi<br>"+"<extra></extra>", selector=dict(mode='markers'))# Display the plotfig.show()# Create a professional table for summary statisticsstats_df = pd.DataFrame(df_merged['trade_area_sq_mi'].describe()).reset_index()stats_df.columns = ['Statistic', 'Value']stats_df['Value'] = stats_df['Value'].round(2)# Ensure median is included in the statisticsif'median'notin stats_df['Statistic'].values: median_value = df_merged['trade_area_sq_mi'].median() median_row = pd.DataFrame({'Statistic': ['median'], 'Value': [round(median_value, 2)]})# Insert median after mean (which is typically at index 1) stats_df = pd.concat([stats_df.iloc[:2], median_row, stats_df.iloc[2:]]).reset_index(drop=True)# Create a styled HTML tablestats_html ="""<div style='max-width: 600px; margin: 20px auto;'> <h4 style='text-align: center; margin-bottom: 15px;'>Summary Statistics: Trade Area Size (Square Miles)</h4> <table style='width: 100%; border-collapse: collapse; font-family: Arial, sans-serif;'> <thead> <tr style='background-color: #f2f2f2;'> <th style='padding: 10px; border: 1px solid #ddd; text-align: left;'>Statistic</th> <th style='padding: 10px; border: 1px solid #ddd; text-align: right;'>Value</th> </tr> </thead> <tbody>"""# Add rows to the tablefor _, row in stats_df.iterrows(): stats_html +=f""" <tr> <td style='padding: 8px; border: 1px solid #ddd; text-align: left;'><b>{row['Statistic']}</b></td> <td style='padding: 8px; border: 1px solid #ddd; text-align: right;'>{row['Value']:,.2f}</td> </tr> """stats_html +=""" </tbody> </table></div>"""# Display the HTML tablefrom IPython.display import HTMLdisplay(HTML(stats_html))
Summary Statistics: Trade Area Size (Square Miles)
Statistic
Value
count
60.00
mean
33.84
median
27.05
std
27.60
min
3.93
25%
16.71
50%
27.05
75%
40.45
max
166.41
Section 2: Clustering Analysis to Understand the Segmentation of the Portfolio
Here we will use more advanced statistical methods to break down the portfolio into different segments based on the demographic and economic characteristics of the trade area.
First, we will reduce our data to the most important components.
Then, we will identify common characteristics unique to certain groups in our portfolio and represent them as clusters.
Part 1: Clustering Using Only Census Data
The following graphs first show a scatter plot of our properties in a reduced dimensional space, with colors indicating the cluster they were assigned to. Then we show a heatmap of the average values of the demographic and economic characteristics of the properties in each cluster - this is measured by z-scores (standard deviations from the mean). Finally, we show a summary table of the characteristics of each cluster.
Analysis: We can see that with the census data, our properties were only split into 2 groups. While we can see the cluster summary shows that there is strong variation between the groups (with one cluster being the lower density, higher income, and higher education group), we might want to include more information to help break down our portfolio into more groups.
# Select segmentation columnsseg_cols = [col for col in df_merged.columns if col.startswith('seg_')]# Prepare data for clusteringX_seg = df_merged[seg_cols].copy()X_seg = X_seg.fillna(X_seg.mean()) # Fill missing values with mean# Scale the datascaler = StandardScaler()X_scaled = scaler.fit_transform(X_seg)# Import required librariesimport umapfrom sklearn.mixture import GaussianMixturefrom sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score, adjusted_mutual_info_scoreimport warnings# Suppress UMAP warningswarnings.filterwarnings("ignore", category=UserWarning, module="umap")# Function to evaluate clusteringdef evaluate_clustering(X, n_components_range, n_neighbors_range): best_silhouette =-1 best_config =None best_labels =None best_umap =Nonefor n_neighbors in n_neighbors_range:# Apply UMAP with explicit n_jobs parameter reducer = umap.UMAP(n_neighbors=n_neighbors, random_state=42, n_jobs=1) X_umap = reducer.fit_transform(X)for n_components in n_components_range:# Fit GMM gmm = GaussianMixture(n_components=n_components, random_state=42) labels = gmm.fit_predict(X_umap)# Calculate metrics silhouette = silhouette_score(X_umap, labels)if silhouette > best_silhouette: best_silhouette = silhouette best_config = (n_neighbors, n_components) best_labels = labels best_umap = X_umapreturn best_silhouette, best_config, best_labels, best_umap# Define parameter rangesn_components_range =range(2, 11)n_neighbors_range = [5, 10, 15, 20, 30]# Evaluate clusteringbest_silhouette, best_config, best_labels, best_umap = evaluate_clustering( X_scaled, n_components_range, n_neighbors_range)# print(f"\nBest configuration:")# print(f"UMAP n_neighbors: {best_config[0]}")# print(f"Number of GMM components: {best_config[1]}")# print(f"Silhouette score: {best_silhouette:.3f}")# Store best clustering resultsdf_merged['Seg_Cluster'] = best_labels# Create UMAP visualizationfig = px.scatter( x=best_umap[:, 0], y=best_umap[:, 1], color=df_merged['Seg_Cluster'], title='Property Clusters in UMAP Space (Segmentation Variables)', labels={'x': 'UMAP Dimension 1','y': 'UMAP Dimension 2','color': 'Cluster'}, hover_data={'Property Name': df_merged['Property Name']})fig.update_layout( template='plotly_white', height=800, width=1200, showlegend=True, legend_title_text='Cluster')# fig.show()# Create heatmap of cluster characteristics with z-scorescluster_profiles_seg = df_merged.groupby('Seg_Cluster')[seg_cols].mean()cluster_profiles_seg_z = cluster_profiles_seg.apply(lambda x: (x - x.mean()) / x.std(), axis=0)fig = go.Figure(data=go.Heatmap( z=cluster_profiles_seg_z.values, x=seg_cols, y=[f'Cluster {i}'for i inrange(best_config[1])], colorscale='RdBu_r', zmid=0, text=np.round(cluster_profiles_seg_z.values, 2), texttemplate='%{text}', textfont={"size": 10}, hoverongaps=False, hovertemplate="<b>%{y}</b><br>"+"<b>%{x}</b>: %{z:.2f} σ<br>"+"<extra></extra>"))fig.update_layout( title='Cluster Segmentation Profiles (Z-Scores)', xaxis_title='Segmentation Variables', yaxis_title='Cluster', height=600, width=700, template='plotly_white')# fig.show()# Print cluster statistics# print("\nCluster Size Distribution:")# print(df_merged['Seg_Cluster'].value_counts().sort_index())# print("\nCluster Characteristics (Z-Scores):")cluster_stats_seg_z = pd.DataFrame({ col: (df_merged[col] - df_merged[col].mean()) / df_merged[col].std()for col in seg_cols})cluster_stats_seg_z['Seg_Cluster'] = df_merged['Seg_Cluster']cluster_stats_seg_z = cluster_stats_seg_z.groupby('Seg_Cluster').agg(['mean', 'std']).round(2)# print(cluster_stats_seg_z)# Compare segmentation clusters with demographic clusters# print("\nCross-tabulation of Demographic and Segmentation Clusters:")cross_tab = pd.crosstab(df_merged['Cluster'], df_merged['Seg_Cluster'])# print(cross_tab)# Calculate adjusted mutual information between the two clusteringsami_score = adjusted_mutual_info_score(df_merged['Cluster'], df_merged['Seg_Cluster'])# print(f"\nAdjusted Mutual Information between clusterings: {ami_score:.3f}")# Create violin plots for key segmentation variables#fig = make_subplots(# rows=2, cols=2,# subplot_titles=('Top 4 Most Variable Segments by Cluster',# 'Next 4 Most Variable Segments by Cluster',#'Next 4 Most Variable Segments by Cluster',#'Remaining Segments by Cluster'),# specs=[[{"type": "violin"}, {"type": "violin"}],# [{"type": "violin"}, {"type": "violin"}]])# Get most variable segmentsvariances = cluster_profiles_seg_z.var().sort_values(ascending=False)top_segments = variances.index.tolist()# Plot segments in groups of 4#for i in range(4):#row = i // 2 + 1#col = i % 2 + 1#start_idx = i * 4#end_idx = min(start_idx + 4, len(top_segments))#for segment in top_segments[start_idx:end_idx]:#fig.add_trace(# go.Violin(# y=df_merged[segment],# x=df_merged['Seg_Cluster'],# name=segment,# box_visible=True,# meanline_visible=True,# points="outliers"#),# row=row, col=col#)#fig.update_layout(# title='Distribution of Key Segmentation Variables by Cluster',# height=1000,# width=1200,# showlegend=True,# template='plotly_white'#)#fig.show()
Part 2: Clustering using all metrics.
Here we will cluster our properties using all of the metrics we have available to us.
We will use the same clustering method as before, but now we will use all of the metrics.
The figures will be a mapping of our properties in a 3D space, a heatmap of the average values of the metrics in each cluster, and a summary table of the characteristics of each cluster along with example properties for each cluster.
Analysis: We can see that we have 4 distinct clusters. Cluster 0 has high population, high percentage of melting pot families and a relatively large trade area. Cluster 1 also has a high tradearea but a lower population and higher porportion of blue collar and rural visitors. Cluster 2 has a moderate population, moderate trade area, and an older demographic. Cluster 3 is highly educated and wealthy but has a relatively small trade area.
Code
# Select all relevant columns for clusteringall_metrics_cols = [# Demographic columns'population', 'median_income', 'median_age', 'pct_white', 'pct_black', 'pct_asian', 'pct_hispanic', 'pct_college_plus', 'pct_homeowners', 'median_house_value', 'trade_area_population_density', 'trade_area_sq_mi',# Visit metrics'visits_per_sqft_2022', 'visits_per_sqft_2023', 'visits_per_sqft_2024',# Segmentation columns*[col for col in df_merged.columns if col.startswith('seg_')]]# Prepare data for clusteringX_all = df_merged[all_metrics_cols].copy()X_all = X_all.fillna(X_all.mean()) # Fill missing values with mean# Scale the datascaler = StandardScaler()X_scaled = scaler.fit_transform(X_all)# Function to evaluate clusteringdef evaluate_clustering(X, n_components_range, n_neighbors_range): best_silhouette =-1 best_config =None best_labels =None best_metrics =Nonefor n_neighbors in n_neighbors_range:# Apply UMAP reducer = umap.UMAP(n_neighbors=n_neighbors, random_state=42) X_umap = reducer.fit_transform(X)for n_components in n_components_range:# Fit GMM gmm = GaussianMixture(n_components=n_components, random_state=42) labels = gmm.fit_predict(X_umap)# Calculate metrics silhouette = silhouette_score(X_umap, labels) calinski = calinski_harabasz_score(X_umap, labels) davies = davies_bouldin_score(X_umap, labels)if silhouette > best_silhouette: best_silhouette = silhouette best_config = (n_neighbors, n_components) best_labels = labels best_umap = X_umap best_metrics = {'silhouette': silhouette, 'calinski': calinski, 'davies': davies}return best_silhouette, best_config, best_labels, best_umap, best_metrics# Define parameter rangesn_components_range =range(3, 11)n_neighbors_range = [5, 10, 15, 20, 30]# Evaluate clusteringbest_silhouette, best_config, best_labels, best_umap, best_metrics = evaluate_clustering( X_scaled, n_components_range, n_neighbors_range)# Store best clustering resultsdf_merged['All_Metrics_Cluster'] = best_labels# Create 3D UMAP visualizationreducer_3d = umap.UMAP(n_neighbors=best_config[0], n_components=3, random_state=42)umap_3d = reducer_3d.fit_transform(X_scaled)fig = px.scatter_3d( x=umap_3d[:, 0], y=umap_3d[:, 1], z=umap_3d[:, 2], color=df_merged['All_Metrics_Cluster'], title='Property Clusters in 3D UMAP Space (All Metrics)', labels={'x': 'UMAP Dimension 1','y': 'UMAP Dimension 2','z': 'UMAP Dimension 3','color': 'Cluster'}, hover_data={'Property Name': df_merged['Property Name']})fig.update_layout( template='plotly_white', height=600, width=700, showlegend=True, legend_title_text='Cluster', scene=dict( xaxis_title='UMAP Dimension 1', yaxis_title='UMAP Dimension 2', zaxis_title='UMAP Dimension 3' ))fig.write_html("umap_3d.html")fig.show()# Create heatmap of cluster characteristics with z-scorescluster_profiles = df_merged.groupby('All_Metrics_Cluster')[all_metrics_cols].mean()cluster_profiles_z = cluster_profiles.apply(lambda x: (x - x.mean()) / x.std(), axis=0)# Create a more focused heatmap with only the most important variables# Select top variables by variance across clusterstop_vars = cluster_profiles_z.var(axis=0).nlargest(15).index.tolist()focused_profiles_z = cluster_profiles_z[top_vars]fig = go.Figure(data=go.Heatmap( z=focused_profiles_z.values, x=top_vars, y=[f'Cluster {i}'for i inrange(best_config[1])], colorscale='RdBu_r', zmid=0, text=np.round(focused_profiles_z.values, 2), texttemplate='%{text}', textfont={"size": 10}, hoverongaps=False, hovertemplate="<b>%{y}</b><br>"+"<b>%{x}</b>: %{z:.2f} σ<br>"+"<extra></extra>"))fig.update_layout( title='Cluster Profiles - Key Differentiating Factors (Z-Scores)', xaxis_title='Metrics', yaxis_title='Cluster', height=600, width=700, template='plotly_white')fig.show()# Create a summary table of cluster characteristicscluster_summary_html ="""<div style='max-width: 1000px; margin: 20px auto;'> <h3 style='text-align: center; color: #2c3e50; margin-bottom: 15px;'>Cluster Summary</h3> <table style='width: 100%; border-collapse: collapse; font-family: Arial, sans-serif;'> <thead> <tr style='background-color: #3498db; color: white;'> <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Cluster</th> <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Size</th> <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Key Characteristics</th> <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Example Properties</th> </tr> </thead> <tbody>"""for cluster_id inrange(best_config[1]):# Get properties in this cluster cluster_properties = df_merged[df_merged['All_Metrics_Cluster'] == cluster_id]# Get top characteristics (highest z-scores) top_chars = cluster_profiles_z.loc[cluster_id].nlargest(5) top_chars_formatted ="<br>".join([f"<b>{col}</b>: {val:.2f}σ"for col, val in top_chars.items()])# Get example properties (random sample of 3) example_props = cluster_properties.sample(min(3, len(cluster_properties)))['Property Name'].tolist() example_props_formatted ="<br>".join(example_props)# Add row to table cluster_summary_html +=f""" <tr style='background-color: {"#f9f9f9"if cluster_id %2==0else"white"};'> <td style='padding: 8px; text-align: center; border: 1px solid #ddd;'>Cluster {cluster_id}</td> <td style='padding: 8px; text-align: center; border: 1px solid #ddd;'>{len(cluster_properties)}</td> <td style='padding: 8px; text-align: left; border: 1px solid #ddd;'>{top_chars_formatted}</td> <td style='padding: 8px; text-align: left; border: 1px solid #ddd;'>{example_props_formatted}</td> </tr> """cluster_summary_html +=""" </tbody> </table></div>"""# Display the HTML tablefrom IPython.display import HTMLdisplay(HTML(cluster_summary_html))
HERITAGE HILL
FOUR POINTS SHOPPING CENTRE
STONEHILL & DEL OBISPO
Part 4: Similarity Scores
Here we will calculate the similarity scores between all properties in our dataset. Here I added a 1/3 weight to each category of metric I obtained (demographic, visit, and segmentation) and then calculated the similarity score for each property pair. However, it would be incredibly easy to change the weights to place more emphasis on certain metrics.
I also provide a list of the top 10 most similar properties to each property in the dataset.
The advantage of using similarity scores is that they enable quick identification of comparable properties. This is particularly useful for benchmarking new acquisitions or evaluating performance against other properties in our portfolio.
Code
# Define weights for different metric categoriesWEIGHTS = {'demographic': 0.3333,'visit': 0.3333,'segmentation': 0.3333}# Define sub-weights for demographic metricsDEMOGRAPHIC_WEIGHTS = {'population': 0.15,'trade_area_sq_mi': 0.15,'trade_area_population_density': 0.1,'median_income': 0.1,'pct_homeowners': 0.1,'pct_white': 0.1,'pct_black': 0.1,'pct_asian': 0.1,'pct_hispanic': 0.1}# Define sub-weights for visit metricsVISIT_WEIGHTS = {'visits_per_sqft_2024': 1}def calculate_metric_similarity(val1, val2, metric_type):"""Calculate similarity between two values for a given metric type"""if pd.isna(val1) or pd.isna(val2):return0# For percentage metrics, use absolute differenceif metric_type in ['pct_white', 'pct_black', 'pct_asian', 'pct_hispanic', 'pct_college_plus']:return100-abs(val1 - val2)# For monetary and population metrics, use relative differenceelif metric_type in ['population', 'median_income']: max_val =max(abs(val1), abs(val2))if max_val ==0:return100return100* (1-abs(val1 - val2) / max_val)# For visit metrics, use relative differenceelif'visits_per_sqft'in metric_type: max_val =max(abs(val1), abs(val2))if max_val ==0:return100return100* (1-abs(val1 - val2) / max_val)# For other metrics, use absolute differenceelse: max_val =max(abs(val1), abs(val2))if max_val ==0:return100return100* (1-abs(val1 - val2) / max_val)def calculate_property_similarity(prop1, prop2):"""Calculate overall similarity between two properties"""# Demographic similarity demographic_similarity =sum( DEMOGRAPHIC_WEIGHTS[metric] * calculate_metric_similarity( prop1[metric], prop2[metric], metric )for metric in DEMOGRAPHIC_WEIGHTS.keys() )# Visit metrics similarity visit_similarity =sum( VISIT_WEIGHTS[metric] * calculate_metric_similarity( prop1[metric], prop2[metric], metric )for metric in VISIT_WEIGHTS.keys() )# Segmentation metrics similarity seg_metrics = [col for col in df_merged.columns if col.startswith('seg_')] seg_similarity =sum( calculate_metric_similarity(prop1[metric], prop2[metric], metric)for metric in seg_metrics ) /len(seg_metrics) if seg_metrics else0# Calculate weighted average total_similarity = ( WEIGHTS['demographic'] * demographic_similarity + WEIGHTS['visit'] * visit_similarity + WEIGHTS['segmentation'] * seg_similarity )returnround(total_similarity, 2)# Create a matrix of similarity scoresproperty_names = df_merged['Property Name'].tolist()n_properties =len(property_names)similarity_matrix = np.zeros((n_properties, n_properties))# Calculate similarity scores for all pairsfor i inrange(n_properties):for j inrange(i +1, n_properties): similarity = calculate_property_similarity( df_merged.iloc[i], df_merged.iloc[j] ) similarity_matrix[i, j] = similarity similarity_matrix[j, i] = similarity # Matrix is symmetric# Convert to DataFrame for easier manipulationsimilarity_df = pd.DataFrame( similarity_matrix, index=property_names, columns=property_names)# Create a list of all property pairs with their similarity scoresproperty_pairs = []for i inrange(n_properties):for j inrange(i +1, n_properties): property_pairs.append({'Property 1': property_names[i],'Property 2': property_names[j],'Similarity Score': similarity_matrix[i, j] })# Convert to DataFrame and sort by similarity scoreproperty_pairs_df = pd.DataFrame(property_pairs)property_pairs_df = property_pairs_df.sort_values('Similarity Score', ascending=False)# Create a styled HTML table for top property pairsfrom IPython.display import HTML# Format the top 10 most similar property pairssimilarity_html ="""<div style='max-width: 800px; margin: 20px auto;'> <h3 style='text-align: center; color: #2c3e50; margin-bottom: 15px;'>Most Similar Property Pairs</h3> <table style='width: 100%; border-collapse: collapse; font-family: Arial, sans-serif;'> <thead> <tr style='background-color: #3498db; color: white;'> <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Property 1</th> <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Property 2</th> <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Similarity Score</th> </tr> </thead> <tbody>"""# Add top 10 rows to the tablefor i, row in property_pairs_df.head(10).iterrows(): similarity_html +=f""" <tr style='background-color: {"#f9f9f9"if i %2==0else"white"};'> <td style='padding: 8px; text-align: left; border: 1px solid #ddd;'>{row['Property 1']}</td> <td style='padding: 8px; text-align: left; border: 1px solid #ddd;'>{row['Property 2']}</td> <td style='padding: 8px; text-align: center; border: 1px solid #ddd;'>{row['Similarity Score']:.2f}</td> </tr> """similarity_html +=""" </tbody> </table></div>"""# Display the HTML tabledisplay(HTML(similarity_html))
Most Similar Property Pairs
Property 1
Property 2
Similarity Score
SANDY PLAINS EXCHANGE
WADE GREEN VILLAGE
76.41
SOUTHGATE VILLAGE
THE MARKET AT VICTORY VILLAGE
75.61
FAIRFIELD SHOPPING CENTER
SOUTHGATE VILLAGE
75.28
BRAWLEY COMMONS
ROSEDALE SHOPPING CENTER
75.21
SPRING HILL PLAZA
WEST TOWN MARKET
74.23
PARKWAY CENTRE
SOUTHGATE VILLAGE
73.99
ROSEDALE SHOPPING CENTER
THE MARKET AT VICTORY VILLAGE
73.91
PUERTA REAL PLAZA
SPRINGBROOK PRAIRIE PAVILION
73.85
SANDY PLAINS EXCHANGE
WEST TOWN MARKET
73.53
ROSEDALE SHOPPING CENTER
SOUTHGATE VILLAGE
73.08
Thank You for Your time Reviewing this!
Email: wsigal@uchicago.edu
Phone: 818-268-3572
Source Code
---title: "SHOPCORE Portfolio Analysis"author: "Will Sigal"date: last-modifiedformat: html: toc: true toc-depth: 3 toc-location: left code-fold: true code-tools: true theme: flatly fig-width: 10 fig-height: 6 fig-dpi: 300 fig-format: png embed-resources: true df-print: paged html-math-method: katex self-contained: trueexecute: warning: false message: false---```{python}import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as sns import plotly.express as pximport plotly.graph_objects as gofrom scipy.stats import gaussian_kdefrom plotly.subplots import make_subplotsfrom sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScalerfrom sklearn.decomposition import PCAfrom sklearn.cluster import KMeansfrom sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score, adjusted_mutual_info_scorefrom yellowbrick.cluster import SilhouetteVisualizerfrom plotly.offline import init_notebook_mode, iplot# Initialize Plotly for notebook integration#init_notebook_mode(connected=True)# Set Plotly configuration for better embedding#import plotly.io as pio#pio.templates.default = "plotly"#pio.renderers.default = "notebook_connected"df_merged = pd.read_csv("/Users/willsigal/Desktop/Work/Shopcore/properties_with_all_metrics.csv")```## Analysis* This was a demonstration of the CRE analysis I can create for acquistions. I began by compiling a comprehensive dataset of a CRE firm's protfolio - in this instance ShopCore, a subsidary of Blackstone. Utilizing web scraping techniques combined with various APIs, I gathered data points for each property. The resulting dataset combines foot-traffic analytics, detailed customer demographic information (see [Persona.ai](https://docs.google.com/spreadsheets/d/1RsZGcXJhDwXjrhECRNgTs917FsBncZtdL1zpBJpnMSI/edit?gid=222214515#gid=222214515)), and census data.* By integrating these sources of data, I developed a classification model that goes beyond traditional industry standards such as the ICSC classification system. The developed model provides insights tailored specifically to ShopCore's unique portfolio visitation and demographic profiles. **The application of this classification system is twofold:**1. **Portfolio Insights:** Rapidly gain a deeper understanding of each property's unique market position and visitor profile within the existing portfolio.2. **Acquisition and Strategic Benchmarking:** Efficiently evaluate potential acquisitions by comparing their profiles to existing high-performing properties. For example, the model enables rapid identification of new opportunities by comparing them to properties already in the portfolio. For example, if you were looking at a property in Bronzeville, Chicago anchored by Jewel Osco and complemented by tenants like Culvers and Snipes, the model would classify it as a hypothetical "Type 4" property thats 70% similar to one of the properties, like Vista Shopping Center.::: {.callout-note}## About This Report (if viewing this as an HTML file)1. Navigate through the document using the table of contents on the left2. Most of the plots are interactive, so feel free to explore the data3. The analysis is draft and could be updated to be more accurate and informative if I was able to get more data, aside from what I could find on the web:::# Section 1: Analyzing the Portfolio### the Top 10 Performing Centers in 2024 by Visits per Square Foot* The graph below shows the top 10 performing centers in 2024 by visits per square foot* We can see that Plaza Pacomia, Stony Creek, and Rosewood Shopping Center are the top performers with 29, 25, and 23 visits per square foot respectively. After that, the numbers drop off to around 20 visits per square foot```{python}# Get top 10 performing centers in 2024top_10_2024 = df_merged.nlargest(10, 'visits_per_sqft_2024')# Create interactive bar chartfig = px.bar( top_10_2024, x='Property Name', y='visits_per_sqft_2024', title='Top 10 Performing Centers by Visits per Square Foot (2024)', labels={'Property Name': 'Shopping Center','visits_per_sqft_2024': 'Visits per Square Foot' }, hover_data={'Property Name': True,'visits_per_sqft_2024': ':.2f','Overall GLA (SF)': True,'visits_2024': True })# Customize the layoutfig.update_layout( xaxis_tickangle=-45, showlegend=False, height=600, margin=dict(t=50, b=100, l=50, r=50))# Add hover templatefig.update_traces( hovertemplate="<b>%{x}</b><br>"+"Visits per SF: %{y:.2f}<br>"+"Total SF: %{customdata[0]:,.0f}<br>"+"Total Visits: %{customdata[1]:,.0f}<br>"+"<extra></extra>")# Display the plotfig.show()# Create a professional table for top performersdisplay_cols = ['Property Name', 'visits_per_sqft_2024', 'Overall GLA (SF)', 'visits_2024']styled_table = top_10_2024[display_cols].copy()# Rename columns for displaystyled_table.columns = ['Property Name', 'Visits per Sq.Ft.', 'GLA (SF)', 'Total Visits (2024)']# Apply styling for a professional lookstyled_table = styled_table.style.format({'Visits per Sq.Ft.': '{:.2f}','GLA (SF)': '{:,.0f}','Total Visits (2024)': '{:,.0f}'}).set_caption('Top 10 Performing Centers by Visits per Square Foot (2024)').set_table_styles([ {'selector': 'caption', 'props': [('font-weight', 'bold'), ('font-size', '1.2em'), ('text-align', 'center')]}, {'selector': 'th', 'props': [('background-color', '#0066cc'), ('color', 'white'), ('font-weight', 'bold'), ('text-align', 'center')]}, {'selector': 'td', 'props': [('text-align', 'center')]}, {'selector': 'tr:nth-of-type(even)', 'props': [('background-color', '#f2f2f2')]}]).hide(axis='index')# Display the styled tabledisplay(styled_table)```### Looking at the Distribution of Visits per Square Foot in 2024* Instead of looking at the top 10 performing centers, I wanted to examine the distribution of visits per square foot in 2024. This will help us understand the range of performance in the portfolio. Are there outliers? And how centered is the distribution?* The mean visits per square foot is 14.2, and the median is 13.6. This means that the distribution is right-skewed, and there are more properties performing below the mean than above.* We can see that Rosewood Shopping Center and Plaza Pacomia are performing above the mean, while Streets of Woodfield and Neapolitan Way are performing far below the mean.```{python}# Create KDE plot with individual pointsfig = go.Figure()# Add KDE curvekde_x = np.linspace(df_merged['visits_per_sqft_2024'].min(), df_merged['visits_per_sqft_2024'].max(), 100)kde = gaussian_kde(df_merged['visits_per_sqft_2024'].dropna())kde_y = kde(kde_x)fig.add_trace(go.Scatter( x=kde_x, y=kde_y, mode='lines', name='KDE', line=dict(color='blue', width=2)))# Add individual pointsfig.add_trace(go.Scatter( x=df_merged['visits_per_sqft_2024'], y=np.zeros_like(df_merged['visits_per_sqft_2024']), # Place points at y=0 mode='markers', name='Properties', marker=dict( size=8, color='red', opacity=0.6 ), hovertext=df_merged['Property Name'], hoverinfo='text+x'))# Update layoutfig.update_layout( title='Distribution of Visits per Square Foot (2024)', xaxis_title='Visits per Square Foot', yaxis_title='Density', showlegend=True, height=600, hovermode='closest')# Add hover template for pointsfig.update_traces( hovertemplate="<b>%{hovertext}</b><br>"+"Visits per SF: %{x:.2f}<br>"+"<extra></extra>", selector=dict(mode='markers'))# Display the plotfig.show()# Create a professional table for summary statisticsstats_df = pd.DataFrame(df_merged['visits_per_sqft_2024'].describe()).reset_index()stats_df.columns = ['Statistic', 'Value']stats_df['Value'] = stats_df['Value'].round(2)# Ensure median is included in the statisticsif'median'notin stats_df['Statistic'].values: median_value = df_merged['visits_per_sqft_2024'].median() median_row = pd.DataFrame({'Statistic': ['median'], 'Value': [round(median_value, 2)]})# Insert median after mean (which is typically at index 1) stats_df = pd.concat([stats_df.iloc[:2], median_row, stats_df.iloc[2:]]).reset_index(drop=True)# Create a styled HTML tablestats_html ="""<div style='max-width: 600px; margin: 20px auto;'> <h4 style='text-align: center; margin-bottom: 15px;'>Summary Statistics: Visits per Square Foot (2024)</h4> <table style='width: 100%; border-collapse: collapse; font-family: Arial, sans-serif;'> <thead> <tr style='background-color: #f2f2f2;'> <th style='padding: 10px; border: 1px solid #ddd; text-align: left;'>Statistic</th> <th style='padding: 10px; border: 1px solid #ddd; text-align: right;'>Value</th> </tr> </thead> <tbody>"""# Add rows to the tablefor _, row in stats_df.iterrows(): stats_html +=f""" <tr> <td style='padding: 8px; border: 1px solid #ddd; text-align: left;'><b>{row['Statistic']}</b></td> <td style='padding: 8px; border: 1px solid #ddd; text-align: right;'>{row['Value']:,.2f}</td> </tr> """stats_html +=""" </tbody> </table></div>"""# Display the HTML tablefrom IPython.display import HTMLdisplay(HTML(stats_html))```### Looking at the change in visits between 2022 and 2024.* Moving from looking a t performance in 2024 to looking at the change in performance over time, we now compare visits in 2022 to 2024. By doing this, we can isolate the top and bottom performers in terms of visit growth and decline. * The change can be due to occupancy changes, market conditions, or changes in the consumer's behavior.```{python}# Calculate the change in visits between 2022 and 2024df_merged['visits_change'] = df_merged['visits_2024'] - df_merged['visits_2022']df_merged['visits_change_pct'] = ((df_merged['visits_2024'] - df_merged['visits_2022']) / df_merged['visits_2022'] *100).round(2)# Get top and bottom 5 performers based on absolute changetop_5_growth = df_merged.nlargest(5, 'visits_change')bottom_5_decline = df_merged.nsmallest(5, 'visits_change')# Figure 1: Top 5 Properties by Visit Growthfig_growth = go.Figure()fig_growth.add_trace( go.Bar( x=top_5_growth['Property Name'], y=top_5_growth['visits_change'], name='Growth', marker_color='#2ecc71', hovertemplate="<b>%{x}</b><br>"+"Change in Visits: %{y:,.0f}<br>"+"2022 Visits: %{customdata[0]:,.0f}<br>"+"2024 Visits: %{customdata[1]:,.0f}<br>"+"Growth: %{customdata[2]}%<br>"+"<extra></extra>", customdata=np.column_stack(( top_5_growth['visits_2022'], top_5_growth['visits_2024'], top_5_growth['visits_change_pct'] )) ))fig_growth.update_layout( title=dict( text='Top 5 Properties by Visit Growth (2022-2024)', x=0.5, y=0.95, xanchor='center', yanchor='top', font=dict(size=18) ), height=500, width=700, showlegend=False, template='plotly_white', margin=dict(t=100, b=100, l=80, r=50))fig_growth.update_xaxes(tickangle=-45, title_text="Property Name", title_standoff=15)fig_growth.update_yaxes(title_text="Change in Number of Visits", title_standoff=15)# Display the growth plotfig_growth.show()# Figure 2: Top 5 Properties by Visit Declinefig_decline = go.Figure()fig_decline.add_trace( go.Bar( x=bottom_5_decline['Property Name'], y=bottom_5_decline['visits_change'], name='Decline', marker_color='#e74c3c', hovertemplate="<b>%{x}</b><br>"+"Change in Visits: %{y:,.0f}<br>"+"2022 Visits: %{customdata[0]:,.0f}<br>"+"2024 Visits: %{customdata[1]:,.0f}<br>"+"Decline: %{customdata[2]}%<br>"+"<extra></extra>", customdata=np.column_stack(( bottom_5_decline['visits_2022'], bottom_5_decline['visits_2024'], bottom_5_decline['visits_change_pct'] )) ))fig_decline.update_layout( title=dict( text='Top 5 Properties by Visit Decline (2022-2024)', x=0.5, y=0.95, xanchor='center', yanchor='top', font=dict(size=18) ), height=500, width=700, showlegend=False, template='plotly_white', margin=dict(t=100, b=100, l=80, r=50))fig_decline.update_xaxes(tickangle=-45, title_text="Property Name", title_standoff=15)fig_decline.update_yaxes(title_text="Change in Number of Visits", title_standoff=15)# Display the decline plotfig_decline.show()# Create styled HTML tables for visit change datafrom IPython.display import HTML# Format top growth properties tabletop_growth_html ="""<div style='max-width: 800px; margin: 20px auto;'> <h3 style='text-align: center; color: #2c3e50; margin-bottom: 15px;'>Top 5 Properties by Visit Growth (2022-2024)</h3> <table style='width: 100%; border-collapse: collapse; font-family: Arial, sans-serif;'> <thead> <tr style='background-color: #3498db; color: white;'> <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Property Name</th> <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>2022 Visits</th> <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>2024 Visits</th> <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Change</th> <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>% Change</th> </tr> </thead> <tbody>"""for _, row in top_5_growth.iterrows(): top_growth_html +=f""" <tr style='background-color: #f9f9f9;'> <td style='padding: 8px; text-align: center; border: 1px solid #ddd;'>{row['Property Name']}</td> <td style='padding: 8px; text-align: center; border: 1px solid #ddd;'>{row['visits_2022']:,.0f}</td> <td style='padding: 8px; text-align: center; border: 1px solid #ddd;'>{row['visits_2024']:,.0f}</td> <td style='padding: 8px; text-align: center; border: 1px solid #ddd; color: green;'>{row['visits_change']:+,.0f}</td> <td style='padding: 8px; text-align: center; border: 1px solid #ddd; color: green;'>{row['visits_change_pct']:+.1f}%</td> </tr> """top_growth_html +=""" </tbody> </table></div>"""# Format bottom decline properties tablebottom_decline_html ="""<div style='max-width: 800px; margin: 20px auto;'> <h3 style='text-align: center; color: #2c3e50; margin-bottom: 15px;'>Top 5 Properties by Visit Decline (2022-2024)</h3> <table style='width: 100%; border-collapse: collapse; font-family: Arial, sans-serif;'> <thead> <tr style='background-color: #e74c3c; color: white;'> <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Property Name</th> <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>2022 Visits</th> <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>2024 Visits</th> <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Change</th> <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>% Change</th> </tr> </thead> <tbody>"""for _, row in bottom_5_decline.iterrows(): bottom_decline_html +=f""" <tr style='background-color: #f9f9f9;'> <td style='padding: 8px; text-align: center; border: 1px solid #ddd;'>{row['Property Name']}</td> <td style='padding: 8px; text-align: center; border: 1px solid #ddd;'>{row['visits_2022']:,.0f}</td> <td style='padding: 8px; text-align: center; border: 1px solid #ddd;'>{row['visits_2024']:,.0f}</td> <td style='padding: 8px; text-align: center; border: 1px solid #ddd; color: red;'>{row['visits_change']:+,.0f}</td> <td style='padding: 8px; text-align: center; border: 1px solid #ddd; color: red;'>{row['visits_change_pct']:+.1f}%</td> </tr> """bottom_decline_html +=""" </tbody> </table></div>"""# Format overall statistics tableoverall_stats_html =f"""<div style='max-width: 800px; margin: 20px auto;'> <h3 style='text-align: center; color: #2c3e50; margin-bottom: 15px;'>Overall Visit Change Statistics (2022-2024)</h3> <table style='width: 100%; border-collapse: collapse; font-family: Arial, sans-serif;'> <tbody> <tr style='background-color: #f2f2f2;'> <td style='padding: 10px; text-align: center; border: 1px solid #ddd; font-weight: bold;'>Average Change</td> <td style='padding: 10px; text-align: center; border: 1px solid #ddd;'>{df_merged['visits_change'].mean():,.0f} visits</td> </tr> <tr> <td style='padding: 10px; text-align: center; border: 1px solid #ddd; font-weight: bold;'>Median Change</td> <td style='padding: 10px; text-align: center; border: 1px solid #ddd;'>{df_merged['visits_change'].median():,.0f} visits</td> </tr> <tr style='background-color: #f2f2f2;'> <td style='padding: 10px; text-align: center; border: 1px solid #ddd; font-weight: bold;'>Properties with Growth</td> <td style='padding: 10px; text-align: center; border: 1px solid #ddd;'>{(df_merged['visits_change'] >0).sum():,}</td> </tr> <tr> <td style='padding: 10px; text-align: center; border: 1px solid #ddd; font-weight: bold;'>Properties with Decline</td> <td style='padding: 10px; text-align: center; border: 1px solid #ddd;'>{(df_merged['visits_change'] <0).sum():,}</td> </tr> </tbody> </table></div>"""# Display all tablesdisplay(HTML(top_growth_html + bottom_decline_html + overall_stats_html))```### Analyzing the Trade Area Size* Let's examine the distribution of our trade areas. This metric helps us understand the size of the area that our shopping center serves.* We can see that the trade area size is right-skewed, with a strong density of properties having trade areas around 25 sq. miles. However, we have a few large outliers with incredibly large trade areas (Streets of Woodfield and One Colorado with +100 sq. mile trade areas).```{python}# Create KDE plot fig = go.Figure()# Add KDE curvekde_x = np.linspace(df_merged['trade_area_sq_mi'].min(), df_merged['trade_area_sq_mi'].max(), 100)kde = gaussian_kde(df_merged['trade_area_sq_mi'].dropna())kde_y = kde(kde_x)fig.add_trace(go.Scatter( x=kde_x, y=kde_y, mode='lines', name='KDE', line=dict(color='blue', width=2)))# Add individual pointsfig.add_trace(go.Scatter( x=df_merged['trade_area_sq_mi'], y=np.zeros_like(df_merged['trade_area_sq_mi']), # Place points at y=0 mode='markers', name='Properties', marker=dict( size=8, color='red', opacity=0.6 ), hovertext=df_merged['Property Name'], hoverinfo='text+x'))# Update layoutfig.update_layout( title='Distribution of Trade Area Sizes', xaxis_title='Trade Area Size (Square Miles)', yaxis_title='Density', showlegend=True, height=600, hovermode='closest')# Add hover template for pointsfig.update_traces( hovertemplate="<b>%{hovertext}</b><br>"+"Trade Area: %{x:.2f} sq mi<br>"+"<extra></extra>", selector=dict(mode='markers'))# Display the plotfig.show()# Create a professional table for summary statisticsstats_df = pd.DataFrame(df_merged['trade_area_sq_mi'].describe()).reset_index()stats_df.columns = ['Statistic', 'Value']stats_df['Value'] = stats_df['Value'].round(2)# Ensure median is included in the statisticsif'median'notin stats_df['Statistic'].values: median_value = df_merged['trade_area_sq_mi'].median() median_row = pd.DataFrame({'Statistic': ['median'], 'Value': [round(median_value, 2)]})# Insert median after mean (which is typically at index 1) stats_df = pd.concat([stats_df.iloc[:2], median_row, stats_df.iloc[2:]]).reset_index(drop=True)# Create a styled HTML tablestats_html ="""<div style='max-width: 600px; margin: 20px auto;'> <h4 style='text-align: center; margin-bottom: 15px;'>Summary Statistics: Trade Area Size (Square Miles)</h4> <table style='width: 100%; border-collapse: collapse; font-family: Arial, sans-serif;'> <thead> <tr style='background-color: #f2f2f2;'> <th style='padding: 10px; border: 1px solid #ddd; text-align: left;'>Statistic</th> <th style='padding: 10px; border: 1px solid #ddd; text-align: right;'>Value</th> </tr> </thead> <tbody>"""# Add rows to the tablefor _, row in stats_df.iterrows(): stats_html +=f""" <tr> <td style='padding: 8px; border: 1px solid #ddd; text-align: left;'><b>{row['Statistic']}</b></td> <td style='padding: 8px; border: 1px solid #ddd; text-align: right;'>{row['Value']:,.2f}</td> </tr> """stats_html +=""" </tbody> </table></div>"""# Display the HTML tablefrom IPython.display import HTMLdisplay(HTML(stats_html))```# Section 2: Clustering Analysis to Understand the Segmentation of the Portfolio* Here we will use more advanced statistical methods to break down the portfolio into different segments based on the demographic and economic characteristics of the trade area.* First, we will reduce our data to the most important components.* Then, we will identify common characteristics unique to certain groups in our portfolio and represent them as clusters.### Part 1: Clustering Using Only Census Data* The following graphs first show a scatter plot of our properties in a reduced dimensional space, with colors indicating the cluster they were assigned to. Then we show a heatmap of the average values of the demographic and economic characteristics of the properties in each cluster - this is measured by z-scores (standard deviations from the mean). Finally, we show a summary table of the characteristics of each cluster.* **Analysis:** We can see that with the census data, our properties were only split into 2 groups. While we can see the cluster summary shows that there is strong variation between the groups (with one cluster being the lower density, higher income, and higher education group), we might want to include more information to help break down our portfolio into more groups.```{python}# Select demographic columnsdemographic_cols = ['population', 'median_income', 'median_age', 'pct_white', 'pct_black', 'pct_asian', 'pct_hispanic', 'pct_college_plus', 'pct_homeowners', 'median_house_value', 'trade_area_population_density', 'trade_area_sq_mi']# Prepare data for clusteringX = df_merged[demographic_cols].copy()X = X.fillna(X.mean()) # Fill missing values with meanscaler = StandardScaler()X_scaled = scaler.fit_transform(X)# Determine optimal number of PCA componentspca_full = PCA()pca_full.fit(X_scaled)cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)# Find optimal number of components using 80% culumative varn_components_80 = np.argmax(cumulative_variance >=0.8) +1# Perform PCA with optimal componentspca = PCA(n_components=n_components_80)X_pca = pca.fit_transform(X_scaled)# Determine optimal number of clusters using multiple metricsK =range(2, 11)metrics = {'Silhouette': [],'Calinski-Harabasz': [],'Davies-Bouldin': []}for k in K: kmeans = KMeans(n_clusters=k, random_state=42) labels = kmeans.fit_predict(X_scaled)# Calculate metrics metrics['Silhouette'].append(silhouette_score(X_scaled, labels)) metrics['Calinski-Harabasz'].append(calinski_harabasz_score(X_scaled, labels)) metrics['Davies-Bouldin'].append(davies_bouldin_score(X_scaled, labels))# Find optimal nmber of clusters using combined metricssilhouette_opt = K[np.argmax(metrics['Silhouette'])]calinski_opt = K[np.argmax(metrics['Calinski-Harabasz'])]davies_opt = K[np.argmin(metrics['Davies-Bouldin'])]# Use the most common optimalkoptimal_k =max(set([silhouette_opt, calinski_opt, davies_opt]), key=[silhouette_opt, calinski_opt, davies_opt].count)# Perform K-means clustering with optimal number of clusterskmeans = KMeans(n_clusters=optimal_k, random_state=42)df_merged['Cluster'] = kmeans.fit_predict(X_scaled)# Create visualization of clusters in PCA spacefig = px.scatter( x=X_pca[:, 0], y=X_pca[:, 1], color=df_merged['Cluster'].astype(str), title='Property Clusters in PCA Space', labels={'x': f'First Principal Component ({pca.explained_variance_ratio_[0]:.1%} variance)','y': f'Second Principal Component ({pca.explained_variance_ratio_[1]:.1%} variance)','color': 'Cluster'}, hover_data={'Property': df_merged['Property Name']})fig.update_layout( template='plotly_white', height=600, width=700, showlegend=True, legend_title_text='Cluster', font=dict(family="Arial, sans-serif", size=12), margin=dict(l=50, r=50, t=80, b=50))fig.show()# Create cluster profile visualizationcluster_profiles = df_merged.groupby('Cluster')[demographic_cols].mean()# Calculate z-scores for each demographic variable across clusterscluster_profiles_z = cluster_profiles.apply(lambda x: (x - x.mean()) / x.std(), axis=0)fig = go.Figure(data=go.Heatmap( z=cluster_profiles_z.values, x=[col.replace('_', ' ').title() for col in demographic_cols], y=[f'Cluster {i}'for i inrange(optimal_k)], colorscale='RdBu_r', zmid=0, text=np.round(cluster_profiles_z.values, 2), texttemplate='%{text}', textfont={"size": 10}, hoverongaps=False, hovertemplate="<b>%{y}</b><br>"+"<b>%{x}</b>: %{z:.2f} σ<br>"+"<extra></extra>"))fig.update_layout( title='Cluster Demographic Profiles (Z-Scores)', xaxis_title='Demographic Variables', yaxis_title='Cluster', height=600, width=700, template='plotly_white', font=dict(family="Arial, sans-serif", size=12), margin=dict(l=50, r=50, t=80, b=50))fig.show()# Create a summary table of cluster characteristicscluster_summary = pd.DataFrame()for i inrange(optimal_k): cluster_properties = df_merged[df_merged['Cluster'] == i] cluster_summary = pd.concat([cluster_summary, pd.DataFrame({'Cluster': [f'Cluster {i}'],'Size': [len(cluster_properties)],'Avg Visits 2024': [cluster_properties['visits_2024'].mean()],'Avg Visits/SF': [cluster_properties['visits_per_sqft_2024'].mean()],'Median Income': [cluster_properties['median_income'].mean()],'Population': [cluster_properties['population'].mean()],'College Education %': [cluster_properties['pct_college_plus'].mean()],'Median House Value': [cluster_properties['median_house_value'].mean()] })])# Display formatted cluster summarycluster_summary_html ="""<div style='max-width: 1000px; margin: 20px auto;'> <h3 style='text-align: center; color: #2c3e50; margin-bottom: 15px;'>Cluster Summary</h3> <table style='width: 100%; border-collapse: collapse; font-family: Arial, sans-serif;'> <thead> <tr style='background-color: #f2f2f2;'> <th style='padding: 10px; border: 1px solid #ddd; text-align: center;'>Cluster</th> <th style='padding: 10px; border: 1px solid #ddd; text-align: center;'>Size</th> <th style='padding: 10px; border: 1px solid #ddd; text-align: center;'>Avg Visits 2024</th> <th style='padding: 10px; border: 1px solid #ddd; text-align: center;'>Avg Visits/SF</th> <th style='padding: 10px; border: 1px solid #ddd; text-align: center;'>Median Income</th> <th style='padding: 10px; border: 1px solid #ddd; text-align: center;'>Population</th> <th style='padding: 10px; border: 1px solid #ddd; text-align: center;'>College Education %</th> <th style='padding: 10px; border: 1px solid #ddd; text-align: center;'>Median House Value</th> </tr> </thead> <tbody>"""for _, row in cluster_summary.iterrows(): cluster_summary_html +=f""" <tr> <td style='padding: 8px; border: 1px solid #ddd; text-align: center;'><b>{row['Cluster']}</b></td> <td style='padding: 8px; border: 1px solid #ddd; text-align: center;'>{row['Size']:,.0f}</td> <td style='padding: 8px; border: 1px solid #ddd; text-align: center;'>{row['Avg Visits 2024']:,.0f}</td> <td style='padding: 8px; border: 1px solid #ddd; text-align: center;'>{row['Avg Visits/SF']:.2f}</td> <td style='padding: 8px; border: 1px solid #ddd; text-align: center;'>${row['Median Income']:,.0f}</td> <td style='padding: 8px; border: 1px solid #ddd; text-align: center;'>{row['Population']:,.0f}</td> <td style='padding: 8px; border: 1px solid #ddd; text-align: center;'>{row['College Education %']:.1f}%</td> <td style='padding: 8px; border: 1px solid #ddd; text-align: center;'>${row['Median House Value']:,.0f}</td> </tr> """cluster_summary_html +=""" </tbody> </table></div>"""display(HTML(cluster_summary_html))``````{python}# Select segmentation columnsseg_cols = [col for col in df_merged.columns if col.startswith('seg_')]# Prepare data for clusteringX_seg = df_merged[seg_cols].copy()X_seg = X_seg.fillna(X_seg.mean()) # Fill missing values with mean# Scale the datascaler = StandardScaler()X_scaled = scaler.fit_transform(X_seg)# Import required librariesimport umapfrom sklearn.mixture import GaussianMixturefrom sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score, adjusted_mutual_info_scoreimport warnings# Suppress UMAP warningswarnings.filterwarnings("ignore", category=UserWarning, module="umap")# Function to evaluate clusteringdef evaluate_clustering(X, n_components_range, n_neighbors_range): best_silhouette =-1 best_config =None best_labels =None best_umap =Nonefor n_neighbors in n_neighbors_range:# Apply UMAP with explicit n_jobs parameter reducer = umap.UMAP(n_neighbors=n_neighbors, random_state=42, n_jobs=1) X_umap = reducer.fit_transform(X)for n_components in n_components_range:# Fit GMM gmm = GaussianMixture(n_components=n_components, random_state=42) labels = gmm.fit_predict(X_umap)# Calculate metrics silhouette = silhouette_score(X_umap, labels)if silhouette > best_silhouette: best_silhouette = silhouette best_config = (n_neighbors, n_components) best_labels = labels best_umap = X_umapreturn best_silhouette, best_config, best_labels, best_umap# Define parameter rangesn_components_range =range(2, 11)n_neighbors_range = [5, 10, 15, 20, 30]# Evaluate clusteringbest_silhouette, best_config, best_labels, best_umap = evaluate_clustering( X_scaled, n_components_range, n_neighbors_range)# print(f"\nBest configuration:")# print(f"UMAP n_neighbors: {best_config[0]}")# print(f"Number of GMM components: {best_config[1]}")# print(f"Silhouette score: {best_silhouette:.3f}")# Store best clustering resultsdf_merged['Seg_Cluster'] = best_labels# Create UMAP visualizationfig = px.scatter( x=best_umap[:, 0], y=best_umap[:, 1], color=df_merged['Seg_Cluster'], title='Property Clusters in UMAP Space (Segmentation Variables)', labels={'x': 'UMAP Dimension 1','y': 'UMAP Dimension 2','color': 'Cluster'}, hover_data={'Property Name': df_merged['Property Name']})fig.update_layout( template='plotly_white', height=800, width=1200, showlegend=True, legend_title_text='Cluster')# fig.show()# Create heatmap of cluster characteristics with z-scorescluster_profiles_seg = df_merged.groupby('Seg_Cluster')[seg_cols].mean()cluster_profiles_seg_z = cluster_profiles_seg.apply(lambda x: (x - x.mean()) / x.std(), axis=0)fig = go.Figure(data=go.Heatmap( z=cluster_profiles_seg_z.values, x=seg_cols, y=[f'Cluster {i}'for i inrange(best_config[1])], colorscale='RdBu_r', zmid=0, text=np.round(cluster_profiles_seg_z.values, 2), texttemplate='%{text}', textfont={"size": 10}, hoverongaps=False, hovertemplate="<b>%{y}</b><br>"+"<b>%{x}</b>: %{z:.2f} σ<br>"+"<extra></extra>"))fig.update_layout( title='Cluster Segmentation Profiles (Z-Scores)', xaxis_title='Segmentation Variables', yaxis_title='Cluster', height=600, width=700, template='plotly_white')# fig.show()# Print cluster statistics# print("\nCluster Size Distribution:")# print(df_merged['Seg_Cluster'].value_counts().sort_index())# print("\nCluster Characteristics (Z-Scores):")cluster_stats_seg_z = pd.DataFrame({ col: (df_merged[col] - df_merged[col].mean()) / df_merged[col].std()for col in seg_cols})cluster_stats_seg_z['Seg_Cluster'] = df_merged['Seg_Cluster']cluster_stats_seg_z = cluster_stats_seg_z.groupby('Seg_Cluster').agg(['mean', 'std']).round(2)# print(cluster_stats_seg_z)# Compare segmentation clusters with demographic clusters# print("\nCross-tabulation of Demographic and Segmentation Clusters:")cross_tab = pd.crosstab(df_merged['Cluster'], df_merged['Seg_Cluster'])# print(cross_tab)# Calculate adjusted mutual information between the two clusteringsami_score = adjusted_mutual_info_score(df_merged['Cluster'], df_merged['Seg_Cluster'])# print(f"\nAdjusted Mutual Information between clusterings: {ami_score:.3f}")# Create violin plots for key segmentation variables#fig = make_subplots(# rows=2, cols=2,# subplot_titles=('Top 4 Most Variable Segments by Cluster',# 'Next 4 Most Variable Segments by Cluster',#'Next 4 Most Variable Segments by Cluster',#'Remaining Segments by Cluster'),# specs=[[{"type": "violin"}, {"type": "violin"}],# [{"type": "violin"}, {"type": "violin"}]])# Get most variable segmentsvariances = cluster_profiles_seg_z.var().sort_values(ascending=False)top_segments = variances.index.tolist()# Plot segments in groups of 4#for i in range(4):#row = i // 2 + 1#col = i % 2 + 1#start_idx = i * 4#end_idx = min(start_idx + 4, len(top_segments))#for segment in top_segments[start_idx:end_idx]:#fig.add_trace(# go.Violin(# y=df_merged[segment],# x=df_merged['Seg_Cluster'],# name=segment,# box_visible=True,# meanline_visible=True,# points="outliers"#),# row=row, col=col#)#fig.update_layout(# title='Distribution of Key Segmentation Variables by Cluster',# height=1000,# width=1200,# showlegend=True,# template='plotly_white'#)#fig.show()```### Part 2: Clustering using all metrics.* Here we will cluster our properties using all of the metrics we have available to us.* We will use the same clustering method as before, but now we will use all of the metrics.* The figures will be a mapping of our properties in a 3D space, a heatmap of the average values of the metrics in each cluster, and a summary table of the characteristics of each cluster along with example properties for each cluster.* **Analysis:** We can see that we have 4 distinct clusters. Cluster 0 has high population, high percentage of melting pot families and a relatively large trade area. Cluster 1 also has a high tradearea but a lower population and higher porportion of blue collar and rural visitors. Cluster 2 has a moderate population, moderate trade area, and an older demographic. Cluster 3 is highly educated and wealthy but has a relatively small trade area.```{python}# Select all relevant columns for clusteringall_metrics_cols = [# Demographic columns'population', 'median_income', 'median_age', 'pct_white', 'pct_black', 'pct_asian', 'pct_hispanic', 'pct_college_plus', 'pct_homeowners', 'median_house_value', 'trade_area_population_density', 'trade_area_sq_mi',# Visit metrics'visits_per_sqft_2022', 'visits_per_sqft_2023', 'visits_per_sqft_2024',# Segmentation columns*[col for col in df_merged.columns if col.startswith('seg_')]]# Prepare data for clusteringX_all = df_merged[all_metrics_cols].copy()X_all = X_all.fillna(X_all.mean()) # Fill missing values with mean# Scale the datascaler = StandardScaler()X_scaled = scaler.fit_transform(X_all)# Function to evaluate clusteringdef evaluate_clustering(X, n_components_range, n_neighbors_range): best_silhouette =-1 best_config =None best_labels =None best_metrics =Nonefor n_neighbors in n_neighbors_range:# Apply UMAP reducer = umap.UMAP(n_neighbors=n_neighbors, random_state=42) X_umap = reducer.fit_transform(X)for n_components in n_components_range:# Fit GMM gmm = GaussianMixture(n_components=n_components, random_state=42) labels = gmm.fit_predict(X_umap)# Calculate metrics silhouette = silhouette_score(X_umap, labels) calinski = calinski_harabasz_score(X_umap, labels) davies = davies_bouldin_score(X_umap, labels)if silhouette > best_silhouette: best_silhouette = silhouette best_config = (n_neighbors, n_components) best_labels = labels best_umap = X_umap best_metrics = {'silhouette': silhouette, 'calinski': calinski, 'davies': davies}return best_silhouette, best_config, best_labels, best_umap, best_metrics# Define parameter rangesn_components_range =range(3, 11)n_neighbors_range = [5, 10, 15, 20, 30]# Evaluate clusteringbest_silhouette, best_config, best_labels, best_umap, best_metrics = evaluate_clustering( X_scaled, n_components_range, n_neighbors_range)# Store best clustering resultsdf_merged['All_Metrics_Cluster'] = best_labels# Create 3D UMAP visualizationreducer_3d = umap.UMAP(n_neighbors=best_config[0], n_components=3, random_state=42)umap_3d = reducer_3d.fit_transform(X_scaled)fig = px.scatter_3d( x=umap_3d[:, 0], y=umap_3d[:, 1], z=umap_3d[:, 2], color=df_merged['All_Metrics_Cluster'], title='Property Clusters in 3D UMAP Space (All Metrics)', labels={'x': 'UMAP Dimension 1','y': 'UMAP Dimension 2','z': 'UMAP Dimension 3','color': 'Cluster'}, hover_data={'Property Name': df_merged['Property Name']})fig.update_layout( template='plotly_white', height=600, width=700, showlegend=True, legend_title_text='Cluster', scene=dict( xaxis_title='UMAP Dimension 1', yaxis_title='UMAP Dimension 2', zaxis_title='UMAP Dimension 3' ))fig.write_html("umap_3d.html")fig.show()# Create heatmap of cluster characteristics with z-scorescluster_profiles = df_merged.groupby('All_Metrics_Cluster')[all_metrics_cols].mean()cluster_profiles_z = cluster_profiles.apply(lambda x: (x - x.mean()) / x.std(), axis=0)# Create a more focused heatmap with only the most important variables# Select top variables by variance across clusterstop_vars = cluster_profiles_z.var(axis=0).nlargest(15).index.tolist()focused_profiles_z = cluster_profiles_z[top_vars]fig = go.Figure(data=go.Heatmap( z=focused_profiles_z.values, x=top_vars, y=[f'Cluster {i}'for i inrange(best_config[1])], colorscale='RdBu_r', zmid=0, text=np.round(focused_profiles_z.values, 2), texttemplate='%{text}', textfont={"size": 10}, hoverongaps=False, hovertemplate="<b>%{y}</b><br>"+"<b>%{x}</b>: %{z:.2f} σ<br>"+"<extra></extra>"))fig.update_layout( title='Cluster Profiles - Key Differentiating Factors (Z-Scores)', xaxis_title='Metrics', yaxis_title='Cluster', height=600, width=700, template='plotly_white')fig.show()# Create a summary table of cluster characteristicscluster_summary_html ="""<div style='max-width: 1000px; margin: 20px auto;'> <h3 style='text-align: center; color: #2c3e50; margin-bottom: 15px;'>Cluster Summary</h3> <table style='width: 100%; border-collapse: collapse; font-family: Arial, sans-serif;'> <thead> <tr style='background-color: #3498db; color: white;'> <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Cluster</th> <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Size</th> <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Key Characteristics</th> <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Example Properties</th> </tr> </thead> <tbody>"""for cluster_id inrange(best_config[1]):# Get properties in this cluster cluster_properties = df_merged[df_merged['All_Metrics_Cluster'] == cluster_id]# Get top characteristics (highest z-scores) top_chars = cluster_profiles_z.loc[cluster_id].nlargest(5) top_chars_formatted ="<br>".join([f"<b>{col}</b>: {val:.2f}σ"for col, val in top_chars.items()])# Get example properties (random sample of 3) example_props = cluster_properties.sample(min(3, len(cluster_properties)))['Property Name'].tolist() example_props_formatted ="<br>".join(example_props)# Add row to table cluster_summary_html +=f""" <tr style='background-color: {"#f9f9f9"if cluster_id %2==0else"white"};'> <td style='padding: 8px; text-align: center; border: 1px solid #ddd;'>Cluster {cluster_id}</td> <td style='padding: 8px; text-align: center; border: 1px solid #ddd;'>{len(cluster_properties)}</td> <td style='padding: 8px; text-align: left; border: 1px solid #ddd;'>{top_chars_formatted}</td> <td style='padding: 8px; text-align: left; border: 1px solid #ddd;'>{example_props_formatted}</td> </tr> """cluster_summary_html +=""" </tbody> </table></div>"""# Display the HTML tablefrom IPython.display import HTMLdisplay(HTML(cluster_summary_html))```### Part 4: Similarity Scores* Here we will calculate the similarity scores between all properties in our dataset. Here I added a 1/3 weight to each category of metric I obtained (demographic, visit, and segmentation) and then calculated the similarity score for each property pair. However, it would be incredibly easy to change the weights to place more emphasis on certain metrics.* I also provide a list of the top 10 most similar properties to each property in the dataset. * The advantage of using similarity scores is that they enable quick identification of comparable properties. This is particularly useful for benchmarking new acquisitions or evaluating performance against other properties in our portfolio.```{python}# Define weights for different metric categoriesWEIGHTS = {'demographic': 0.3333,'visit': 0.3333,'segmentation': 0.3333}# Define sub-weights for demographic metricsDEMOGRAPHIC_WEIGHTS = {'population': 0.15,'trade_area_sq_mi': 0.15,'trade_area_population_density': 0.1,'median_income': 0.1,'pct_homeowners': 0.1,'pct_white': 0.1,'pct_black': 0.1,'pct_asian': 0.1,'pct_hispanic': 0.1}# Define sub-weights for visit metricsVISIT_WEIGHTS = {'visits_per_sqft_2024': 1}def calculate_metric_similarity(val1, val2, metric_type):"""Calculate similarity between two values for a given metric type"""if pd.isna(val1) or pd.isna(val2):return0# For percentage metrics, use absolute differenceif metric_type in ['pct_white', 'pct_black', 'pct_asian', 'pct_hispanic', 'pct_college_plus']:return100-abs(val1 - val2)# For monetary and population metrics, use relative differenceelif metric_type in ['population', 'median_income']: max_val =max(abs(val1), abs(val2))if max_val ==0:return100return100* (1-abs(val1 - val2) / max_val)# For visit metrics, use relative differenceelif'visits_per_sqft'in metric_type: max_val =max(abs(val1), abs(val2))if max_val ==0:return100return100* (1-abs(val1 - val2) / max_val)# For other metrics, use absolute differenceelse: max_val =max(abs(val1), abs(val2))if max_val ==0:return100return100* (1-abs(val1 - val2) / max_val)def calculate_property_similarity(prop1, prop2):"""Calculate overall similarity between two properties"""# Demographic similarity demographic_similarity =sum( DEMOGRAPHIC_WEIGHTS[metric] * calculate_metric_similarity( prop1[metric], prop2[metric], metric )for metric in DEMOGRAPHIC_WEIGHTS.keys() )# Visit metrics similarity visit_similarity =sum( VISIT_WEIGHTS[metric] * calculate_metric_similarity( prop1[metric], prop2[metric], metric )for metric in VISIT_WEIGHTS.keys() )# Segmentation metrics similarity seg_metrics = [col for col in df_merged.columns if col.startswith('seg_')] seg_similarity =sum( calculate_metric_similarity(prop1[metric], prop2[metric], metric)for metric in seg_metrics ) /len(seg_metrics) if seg_metrics else0# Calculate weighted average total_similarity = ( WEIGHTS['demographic'] * demographic_similarity + WEIGHTS['visit'] * visit_similarity + WEIGHTS['segmentation'] * seg_similarity )returnround(total_similarity, 2)# Create a matrix of similarity scoresproperty_names = df_merged['Property Name'].tolist()n_properties =len(property_names)similarity_matrix = np.zeros((n_properties, n_properties))# Calculate similarity scores for all pairsfor i inrange(n_properties):for j inrange(i +1, n_properties): similarity = calculate_property_similarity( df_merged.iloc[i], df_merged.iloc[j] ) similarity_matrix[i, j] = similarity similarity_matrix[j, i] = similarity # Matrix is symmetric# Convert to DataFrame for easier manipulationsimilarity_df = pd.DataFrame( similarity_matrix, index=property_names, columns=property_names)# Create a list of all property pairs with their similarity scoresproperty_pairs = []for i inrange(n_properties):for j inrange(i +1, n_properties): property_pairs.append({'Property 1': property_names[i],'Property 2': property_names[j],'Similarity Score': similarity_matrix[i, j] })# Convert to DataFrame and sort by similarity scoreproperty_pairs_df = pd.DataFrame(property_pairs)property_pairs_df = property_pairs_df.sort_values('Similarity Score', ascending=False)# Create a styled HTML table for top property pairsfrom IPython.display import HTML# Format the top 10 most similar property pairssimilarity_html ="""<div style='max-width: 800px; margin: 20px auto;'> <h3 style='text-align: center; color: #2c3e50; margin-bottom: 15px;'>Most Similar Property Pairs</h3> <table style='width: 100%; border-collapse: collapse; font-family: Arial, sans-serif;'> <thead> <tr style='background-color: #3498db; color: white;'> <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Property 1</th> <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Property 2</th> <th style='padding: 10px; text-align: center; border: 1px solid #ddd;'>Similarity Score</th> </tr> </thead> <tbody>"""# Add top 10 rows to the tablefor i, row in property_pairs_df.head(10).iterrows(): similarity_html +=f""" <tr style='background-color: {"#f9f9f9"if i %2==0else"white"};'> <td style='padding: 8px; text-align: left; border: 1px solid #ddd;'>{row['Property 1']}</td> <td style='padding: 8px; text-align: left; border: 1px solid #ddd;'>{row['Property 2']}</td> <td style='padding: 8px; text-align: center; border: 1px solid #ddd;'>{row['Similarity Score']:.2f}</td> </tr> """similarity_html +=""" </tbody> </table></div>"""# Display the HTML tabledisplay(HTML(similarity_html))```## Thank You for Your time Reviewing this!* **Email:** wsigal@uchicago.edu* **Phone:** 818-268-3572